Economical Inversion of Large Text Files

نویسنده

Alistair Moffat

چکیده

To provide keyword-based access to a large text file it is usually necessary to invert the file and create an inverted index that storeso for each word in the file, the paragraph or sentence numbers in which that word occurs. Inverting alarge file using traditional techniques may take as much temporary disk space as is occupied by the file itself, and consume a great deal of cpu time. Here we describe an alternative technique for inverting large text files that requires only a nominal amount of temporary disk storage, instead building the inverted index in compressed form in main memory. A program implementing this approach has created a paragraph level index of a I32 Mbyte collection of legal documents using 13 Mbyte of main memory; 500 Kbyte of temporary disk storage; and approximately 45 cpu-minutes on a Sun SPARCstation 2. @ Computing Systems, Vol. 5 . No. 2 ' Spring 1992 125

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient single-pass index construction for text databases

Efficient construction of inverted indexes is essential to provision of search over large collections of text data. In this article, we review the principal approaches to inversion, analyze their theoretical cost, and present experimental results. We identify the drawbacks of existing inversion approaches and propose a single-pass inversion method that, in contrast to previous approaches, does ...

متن کامل

Compression and Fast Indexing for Multi-Gigabyte Text Databases

In the last two years we have developed improved techniques for indexing and retrieval of text data, including algorithms for inversion, for compression of the data and index, and for economical ranking. These techniques were, however, tested on relatively small databases. In this paper we describe our experiences in scaling these techniques up to a large (2 Gb) heterogeneous text database. Our...

متن کامل

Phase Inversion in a Batch Liquid – Liquid Stirred System

"> Phase inversion phenomenon occurs in many industrial processes including liquidliquid dispersions. Some parameters such as energy input or the presence of mineral compounds in the system affect this phen...

متن کامل

Large-scale Inversion of Magnetic Data Using Golub-Kahan Bidiagonalization with Truncated Generalized Cross Validation for Regularization Parameter Estimation

In this paper a fast method for large-scale sparse inversion of magnetic data is considered. The L1-norm stabilizer is used to generate models with sharp and distinct interfaces. To deal with the non-linearity introduced by the L1-norm, a model-space iteratively reweighted least squares algorithm is used. The original model matrix is factorized using the Golub-Kahan bidiagonalization that proje...

متن کامل

A Superimposed Coding Scheme Based on Multiple Block Descriptor Files for Indexing Very Large Data Bases

A new signature file method for accessing information from large data files containing both formatted and free text data is presented. The new method, called the multiorganizational scheme is proposed for indexing very large data files containing hundreds of thousands or possibly millions of records.

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

Computing Systems

دوره 5 شماره

صفحات -

تاریخ انتشار 1992

Economical Inversion of Large Text Files

نویسنده

چکیده

منابع مشابه

Efficient single-pass index construction for text databases

Compression and Fast Indexing for Multi-Gigabyte Text Databases

Phase Inversion in a Batch Liquid – Liquid Stirred System

Large-scale Inversion of Magnetic Data Using Golub-Kahan Bidiagonalization with Truncated Generalized Cross Validation for Regularization Parameter Estimation

A Superimposed Coding Scheme Based on Multiple Block Descriptor Files for Indexing Very Large Data Bases

عنوان ژورنال:

اشتراک گذاری